31 research outputs found
Attending Category Disentangled Global Context for Image Classification
In this paper, we propose a general framework for image classification using
the attention mechanism and global context, which could incorporate with
various network architectures to improve their performance. To investigate the
capability of the global context, we compare four mathematical models and
observe the global context encoded in the category disentangled conditional
generative model could give more guidance as "know what is task irrelevant will
also know what is relevant". Based on this observation, we define a novel
Category Disentangled Global Context (CDGC) and devise a deep network to obtain
it. By attending CDGC, the baseline networks could identify the objects of
interest more accurately, thus improving the performance. We apply the
framework to many different network architectures and compare with the
state-of-the-art on four publicly available datasets. Extensive results
validate the effectiveness and superiority of our approach. Code will be made
public upon paper acceptance.Comment: Under revie
See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data
Zero-shot point cloud segmentation aims to make deep models capable of
recognizing novel objects in point cloud that are unseen in the training phase.
Recent trends favor the pipeline which transfers knowledge from seen classes
with labels to unseen classes without labels. They typically align visual
features with semantic features obtained from word embedding by the supervision
of seen classes' annotations. However, point cloud contains limited information
to fully match with semantic features. In fact, the rich appearance information
of images is a natural complement to the textureless point cloud, which is not
well explored in previous literature. Motivated by this, we propose a novel
multi-modal zero-shot learning method to better utilize the complementary
information of point clouds and images for more accurate visual-semantic
alignment. Extensive experiments are performed in two popular benchmarks, i.e.,
SemanticKITTI and nuScenes, and our method outperforms current SOTA methods
with 52% and 49% improvement on average for unseen class mIoU, respectively.Comment: Accepted by ICCV 202
Detecting Abrupt Change of Channel Covariance Matrix in IRS-Assisted Communication
The knowledge of channel covariance matrices is crucial to the design of
intelligent reflecting surface (IRS) assisted communication. However, channel
covariance matrices may change suddenly in practice. This letter focuses on the
detection of the above change in IRS-assisted communication. Specifically, we
consider the uplink communication system consisting of a single-antenna user
(UE), an IRS, and a multi-antenna base station (BS). We first categorize two
types of channel covariance matrix changes based on their impact on system
design: Type I change, which denotes the change in the BS receive covariance
matrix, and Type II change, which denotes the change in the IRS
transmit/receive covariance matrix. Secondly, a powerful method is proposed to
detect whether a Type I change occurs, a Type II change occurs, or no change
occurs. The effectiveness of our proposed scheme is verified by numerical
results.Comment: accepted by IEEE Wireless Communications Letter
Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
Recent advancements in vision foundation models (VFMs) have opened up new
possibilities for versatile and efficient visual perception. In this work, we
introduce Seal, a novel framework that harnesses VFMs for segmenting diverse
automotive point cloud sequences. Seal exhibits three appealing properties: i)
Scalability: VFMs are directly distilled into point clouds, obviating the need
for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial
and temporal relationships are enforced at both the camera-to-LiDAR and
point-to-segment regularization stages, facilitating cross-modal representation
learning. iii) Generalizability: Seal enables knowledge transfer in an
off-the-shelf manner to downstream tasks involving diverse point clouds,
including those from real/synthetic, low/high-resolution, large/small-scale,
and clean/corrupted datasets. Extensive experiments conducted on eleven
different point cloud datasets showcase the effectiveness and superiority of
Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear
probing, surpassing random initialization by 36.9% mIoU and outperforming prior
arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains
over existing methods across 20 different few-shot fine-tuning tasks on all
eleven tested point cloud datasets.Comment: NeurIPS 2023 (Spotlight); 37 pages, 16 figures, 15 tables; Code at
https://github.com/youquanl/Segment-Any-Point-Clou
Towards Label-free Scene Understanding by Vision Foundation Models
Vision foundation models such as Contrastive Vision-Language Pre-training
(CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot
performance on image classification and segmentation tasks. However, the
incorporation of CLIP and SAM for label-free scene understanding has yet to be
explored. In this paper, we investigate the potential of vision foundation
models in enabling networks to comprehend 2D and 3D worlds without labelled
data. The primary challenge lies in effectively supervising networks under
extremely noisy pseudo labels, which are generated by CLIP and further
exacerbated during the propagation from the 2D to the 3D domain. To tackle
these challenges, we propose a novel Cross-modality Noisy Supervision (CNS)
method that leverages the strengths of CLIP and SAM to supervise 2D and 3D
networks simultaneously. In particular, we introduce a prediction consistency
regularization to co-train 2D and 3D networks, then further impose the
networks' latent space consistency using the SAM's robust feature
representation. Experiments conducted on diverse indoor and outdoor datasets
demonstrate the superior performance of our method in understanding 2D and 3D
open environments. Our 2D and 3D network achieves label-free semantic
segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%,
respectively. And for nuScenes dataset, our performance is 26.8% with an
improvement of 6%. Code will be released
(https://github.com/runnanchen/Label-Free-Scene-Understanding)
Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
The robustness of 3D perception systems under natural corruptions from
environments and sensors is pivotal for safety-critical applications. Existing
large-scale 3D perception datasets often contain data that are meticulously
cleaned. Such configurations, however, cannot reflect the reliability of
perception models during the deployment stage. In this work, we present Robo3D,
the first comprehensive benchmark heading toward probing the robustness of 3D
detectors and segmentors under out-of-distribution scenarios against natural
corruptions that occur in real-world environments. Specifically, we consider
eight corruption types stemming from adversarial weather conditions, external
disturbances, and internal sensor failure. We uncover that, although promising
results have been progressively achieved on standard benchmarks,
state-of-the-art 3D perception models are at risk of being vulnerable to
corruptions. We draw key observations on the use of data representations,
augmentation schemes, and training strategies, that could severely affect the
model's performance. To pursue better robustness, we propose a
density-insensitive training framework along with a simple flexible
voxelization strategy to enhance the model resiliency. We hope our benchmark
and approach could inspire future research in designing more robust and
reliable 3D perception models. Our robustness benchmark suite is publicly
available.Comment: 33 pages, 26 figures, 26 tables; code at
https://github.com/ldkong1205/Robo3D project page at
https://ldkong.com/Robo3
Rethinking Range View Representation for LiDAR Segmentation
LiDAR segmentation is crucial for autonomous driving perception. Recent
trends favor point- or voxel-based methods as they often yield better
performance than the traditional range view representation. In this work, we
unveil several key factors in building powerful range view models. We observe
that the "many-to-one" mapping, semantic incoherence, and shape deformation are
possible impediments against effective learning from range view projections. We
present RangeFormer -- a full-cycle framework comprising novel designs across
network architecture, data augmentation, and post-processing -- that better
handles the learning and processing of LiDAR point clouds from the range view.
We further introduce a Scalable Training from Range view (STR) strategy that
trains on arbitrary low-resolution 2D range images, while still maintaining
satisfactory 3D segmentation accuracy. We show that, for the first time, a
range view method is able to surpass the point, voxel, and multi-view fusion
counterparts in the competing LiDAR semantic and panoptic segmentation
benchmarks, i.e., SemanticKITTI, nuScenes, and ScribbleKITTI.Comment: ICCV 2023; 24 pages, 10 figures, 14 tables; Webpage at
https://ldkong.com/RangeForme